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Abstract —This paper presents a novel approach for combining 
optical flow into enhanced 3D motion vector fields for human 
action recognition. Our approach detects motion of the actors 
by computing optical flow in video data captured by a multi¬ 
view camera setup with an arbitrary number of views. Optical 
flow is estimated in each view and extended to 3D using 3D 
reconstructions of the actors and pixel-to-vertex correspondences. 
The resulting 3D optical flow for each view is combined into a 
3D motion vector field by taking the significance of local motion 
and its reliability into account. 3D Motion Context (3D-MC) 
and Harmonic Motion Context (HMC) are used to represent 
the extracted 3D motion vector fields efficiently and in a view- 
invariant manner, while considering difference in anthropometry 
of the actors and their movement style variations. The resulting 
3D-MC and HMC descriptors are classified into a set of human 
actions using normalized correlation, taking into account the 
performing speed variations of different actors. We compare the 
performance of the 3D-MC and HMC descriptors, and show 
promising experimental results for the publicly available i3DPost 
Multi-View Human Action Dataset. 

Index Terms —human action recognition; multi-view; 3D op¬ 
tical flow; 3D motion description 

I. Introduction 

In this paper we address the problem of 3D human action 
recognition for multi-view camera systems. While 2D human 
action recognition has received high interest during the last 
decade, 3D human action recognition is still a quite unexplored 
field. Relatively few authors have so far reported work on 3D 
human action recognition [1], [2], [3]. We contribute to this 
field by introducing a novel 3D action recognition approach 
for multi-view camera systems. 

Multi-View Camera Systems. A 3D representation is more 
informative than the analysis of 2D activities carried out in 
the image plane, which is only a projection of the actual 
actions. As a result, the projection of the actions will depend 
on the viewpoint, and not contain full information about the 
performed activities. To overcome this shortcoming the use 
of 3D data has been introduced through the use of two or 
more cameras [4], [5], [6]. In this way the surface structure 
or a 3D volume of the person can be reconstructed, e.g., by 
Shape-From-Silhouette (SFS) techniques [7], and thereby a 
more descriptive representation for action recognition can be 
established. 

View-Invariant Feature Description. The use of 3D data 


allows for efficient analysis of 3D human activities. However, 
we are still faced with the problem that the orientation of 
the subject in the 3D space should be known. Therefore, 
approaches have been proposed without this assumption by 
introducing view-invariant or view-independent representa¬ 
tions. One line of work concentrates solely on the image 
data acquired by multiple cameras [8], [9], [10]. In the work 
of Souvenir et al. [10], where the acquired data from the 
5 calibrated and synchronized cameras, used to produce the 
INRIA Xmas Motion Acquisition Sequences (IXMAS) Multi- 
View Human Action Dataset [6], is further projected to 64 
evenly spaced virtual cameras used for training. Actions are 
described in a view-invariant manner by computing 1Z trans¬ 
form surfaces of silhouettes and manifold learning. Gkalelis 
et al. [8] exploits the circular shift invariance property of the 
discrete Fourier Transform (DFT) magnitudes, and use Fuzzy 
Vector Quantization (FVQ) and Linear Discriminant Analysis 
(LDA) to represent and classify actions. For additional related 
work on view-invariant approaches please refer to the recent 
survey by Ji et al. [9]. 

3D Feature Descriptors. Another line of work utilize the 
full reconstructed 3D data for feature extraction and descrip¬ 
tion [11], [12], [13], [14], [15]. Johnson and Hebert proposed 
the spin image [12], and Osada et al. the shape distribu¬ 
tion [15]. Ankerst et al. introduced the shape histogram [11], 
which is a similar to the 3D extended shape context [16] 
presented by Kortgen et al. [14], and Kazhdan et al. applied 
spherical harmonics to represent the shape histogram in a 
view-invariant manner [13]. Later Huang et al. extended the 
shape histogram with color information [17]. Recently, Huang 
et al. made a comparison of these shape descriptors combined 
with self similarities, with the shape histogram (3D shape 
context) as the top performing descriptor [18]. 
Spatio-Temporal Descriptors. A common characteristic of 
all these approaches is that they are solely based on static 
features, like shape and pose description, while the most 
popular and best performing 2D image descriptors apply 
motion information or a combination of the two [19], [20], 
[21], [22], [23]. Some authors add temporal information by 
capturing the evolvement of static descriptors over time, i.e., 
shape and pose changes [4], [24], [25], [6], [26]. The common 
trends are to accumulate static descriptors over time, track 
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Fig. 1. A schematic overview of the system structure and data flow pipeline of our approach. 


human shape or pose information, or apply sliding windows 
to capture the temporal contents [1], [25], [6], [3]. Cohen et 
al. [4] use 3D human body shapes and Support Vector Ma¬ 
chines (SVM) for view-invariant identification of human body 
postures. They apply a cylindrical histogram and compute an 
invariant measure of the distribution of reconstructed voxels, 
which later was used by Pierobon et al. [25] for human action 
recognition. 

The Motion History Volume (MVH) was proposed by 
Weinland et al. [6], as a 3D extension of Motion History 
Images (MHIs). MHVs are created by accumulating static 
human postures over time in a cylindrical representation, 
which is made view-invariant with respect to the vertical axis 
by applying the Fourier transform in cylindrical coordinates. 
Later, Weinland et al. [26] proposed a framework, where 
actions are modeled using 3D occupancy grids, built from 
multiple viewpoints, in an exemplar-based Hidden Markov 
Models (HMM). Learned 3D exemplars are used to produce 
2D image information which is compared to the observations, 
hence, 3D reconstruction is not required during the recognition 
phase. Recently, Huang et al. proposed 3D shape matching in 
temporal sequences by time filtering and shape flows [18]. 
Kilner et al. [24] applied the shape histogram and evaluated 
similarity measures for action matching and key-pose detection 
in sports events, using 3D data available in the multi-camera 
broadcast environment. 

3D Motion Descriptors. To the best of our knowledge, 
the only 3D descriptors which are directly based on motion 
information are the 3D Motion Context (3D-MC) [27] and the 
Harmonic Motion Context (HMC) [27] proposed by Holte et 
al. The 3D-MC descriptor is a motion oriented 3D version 


of the shape context [16], [14], which incorporates motion 
information implicitly by representing estimated 3D optical 
flow by embedded Histograms of 3D Optical Flow (3D-HOF) 
in a spherical histogram. The HMC descriptor is an extended 
version of the 3D-MC descriptor that makes it view-invariant 
by decomposing the representation into a set of spherical 
harmonic basis functions. 

Our Approach and Contributions. In this work we perform 
3D human action recognition using video data acquired by 
multi-view camera systems and reconstructed 3D mesh mod¬ 
els. A schematic overview of our approach is illustrated in 
Figure 1. The contributions of this paper are threefold: (1) 
we detect motion by computing optical flow in 2D multi¬ 
frames, and extend it to 3D flow by estimating pixel-to- 
vertex correspondences. The resulting 3D optical flow for each 
view is combined into 3D motion vector fields by taking the 
significance of local motion and its reliability into account. 
(2) We apply the 3D Motion Context (3D-MC) and the 
view-invariant Harmonic Motion Context (HMC) descriptors 
proposed by Holte et al. [27] to represent the extracted 3D 
motion vector fields efficiently. The resulting 3D-MC and 
HMC descriptors are classified into a set of human actions 
using normalized correlation, which incorporates robustness to 
performing speed variations of different actors. (3) In contrast 
to the work reported in [27], where only limited experiments 
are conducted for a small-scale human action dataset acquired 
by a Time-of-Flight sensor, we evaluate our proposed approach 
on the recent produced and publicly available i3DPost Multi- 
View Human Action Dataset [5]. Furthermore, we compare 
the performance of the 3D-MC and HMC descriptors for a 
variable number of actions and camera views used for training 


































and testing of the system, and show promising experimental 
results for both descriptors within an accuracy range of 76- 
100%. To the best of our knowledge, we are the first to extract 
rich 3D motion in the form of motion vector fields and apply 
3D motion description for multi-view data. 

Paper Structure. The remainder of the paper is organized as 
follows. In section II we present our technique for multi-view 
motion detection, and describe how the estimated 2D motion 
is extended to 3D and combined into motion vector fields. 
Section III outlines the 3D-MC and HMC 3D motion descrip¬ 
tors, and section IV narrates the action classification applied 
for action recognition. Experimental results and comparisons 
are reported in section V, followed up by concluding remarks 
in section VI. 


3D Optical Flow by Pixel-to-Vertex Correspondences. For 

each pixel in the multi-frames we transform the temporal pixel 
correspondences into temporal 3D vertex correspondences 
(PfejPp 1 ), which can be used to compute 3D velocities 
V 3 d = (v x ,v y ,v z ) T = p\ — pp 1 . For this purpose we use the 
camera calibration data for the multi-view camera system [5], 
and project the vertices p of reconstructed 3D mesh models [7] 
onto the respective image planes with coordinates (u, v), using 
the following set of equations: 
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II. Multi-View Motion Detection 

We detect motion in Multi-frames T = (ii, I 2 , ..., I n ) us¬ 
ing a 3D version of optical flow to produce velocity annotated 
point clouds [28], [29], [30] (3D optical flow). Afterwards we 
combine the estimated 3D optical flow for each view into a 3D 
motion vector field by taking the significance of local motion 
and its reliability into account (see Figure 1). 

Optical Flow Estimation in Multi-Frames. Optical flow is 
the pattern of apparent motion in a visual scene caused by the 
relative motion between an observer and the scene. The main 
benefit of optical flow compared to other motion detection 
techniques, like double differencing [31], is that optical flow 
determines both the amount of motion and its direction in form 
of velocity vectors. The technique computes the optical flow 
of each image pixel as the distribution of apparent velocity of 
moving brightness patterns in an image. The flow of a constant 
brightness profile can be described by the constant velocity 
vector V 2 d = ( v x ,v y ) T as outlined in Equation 1. 

I(x,y,t) = I{x + Sx,y + Sy, t + St) 

= I(x + v x • St, y + v y • St, t + St) (1) 
dl dl dl 

^ &c' Vx+ dy' Vy ~~di 

Usually, the estimation of optical flow is based on differential 
methods. They can be classified into global strategies which 
attempt to minimize a global energy functional [32] and local 
methods, that optimize some local energy-like expression. A 
prominent local optical flow algorithm developed by Fucas 
and Kanade [33], which has proven to be among the top per¬ 
forming algorithms [34], uses the spatial intensity gradient of 
an image to find matching candidates using a type of Newton- 
Raphson iteration. They assume the optical flow to be constant 
within a certain neighborhood, which allows to solve the 
optical flow constraint equation (Equation 1) via least square 
minimization. Optical flow is computed for each multi-frame 
Ti of a multi-view sequence of images {T\ , T 2 , • • •, Tm) and 
based on data from two consecutive multi-frames 1 ). 

Each pixel of multi-frame Ti is annotated with a 2D velocity 
vector V 2 d = (v x ,v y ) T (see Figure 1), resulting in temporal 
pixel correspondences between multi-frame T and Ti- 1 . 


where R and t are the camera rotation matrix and translation 
vector; f x and f y are the x and y components of the focal 
length /; c x and c y are the x and y components of the principal 
point c, and k\ is the coefficient of a first order distortion 
model for the i th camera, respectively. Since multiple vertices 
might be projected onto the same image pixel, we create a 
z-buffer containing the depth ordered vertices p r j, and select 
the vertex with the shortest distance to the respective camera. 
The distance d is determined with respect to the centre of 
projection o, as follows: 

z-buffer = [p d ,i, Pd,2, • • •, Pd,n] (3) 

d = | pi - Oj[, where o* = -Rju 


This has proven to work well for selecting the best correspond¬ 
ing vertices in case of multiple instances. Figure 2.a and 2.d 
present examples of estimated 3D optical flow. However, 
some amount of noise due to erroneous reconstructed 3D data 
or falsified pixel-to-vertex correspondences, resulting from 
imprecise optical flow estimation, are still present in the 3D 
optical flow. These corrupted velocity vectors are eliminated to 
some extent by simple filtering and thresholding, and handled 
in the following by the proposed multi-flow fusion scheme 
combining the 3D flow computed in multi-views into one 
resulting motion vector field. 

Motion Vector Fields. The 3D optical flow for each view V* 
is combined to a resulting 3D motion vector field V res . This 
could be done by a simple averaging over the flow components 
for each view V mea n (see Figure 2 .b and 2 .e). However, instead 
we weight each component by the significance of local motion 
S* and the reliability of the estimated optical flow R*, as given 
by Equation 4: 
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where n is the number of camera views, a and f3 are weights 
of the two measurements, such that a + ft = 1 (we set 
a = 0.75 and P = 0.25). Since we focus on motion vectors, 
we are interested in robust and significant motion. Therefore, 
we apply a weight S = yJ^ 2 D,x + v \d y t0 eac ^ °f the 
velocity components (v x ,v y ,v z ) falling within the region of 






Fig. 2. Examples of single view 3D optical flow (a) and (d), “mean 3D optical flow” (b) and (e), and motion vector fields (c) and (f). 


immi 

Fig. 3. Projected silhouettes of the 3D mesh models onto the respective 
image planes for 8 camera views. 


interest, determined by the projected silhouettes of the 3D 
mesh models onto the respective image planes (see Figure 3). 
In this way we give emphasis to the velocity components 
based on the total length of the estimated 2D optical flow 
vector, i.e., the significance of local motions. This had proven 
to be an important asset, reducing the impact of erroneous 3D 
motion vectors, when falsified pixel-to-vertex correspondences 
have been established. The reliability R is a measure of the 
“cornerness” of the gradients in the window used to estimate 
optical flow , and is determined by the smallest eigenvalue 
R = A 2 of the second moment matrice, 
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In this way we check for ill conditioned second moment 
matrices , and give emphasis to flow components based on 
their reliability. This weighting and combination scheme has 
shown to be a robust solution, resulting in more consistent 
and homogeneous vector fields, with less outliers and less 
erroneous motion vectors. Figure 2.c and 2.f show examples 
of the resulting motion vector fields. 


III. 3D Motion Descriptors 

The extracted 3D motion in the form of motion vector fields 
are represented efficiently using 3D Motion Context (3D- 
MC), and transformed into a view-invariant Harmonic Motion 
Context (HMC) representation using spherical harmonics. In 
the following we give a short description of the two descriptors 
introduced by Holte et al. [27]. 

3D Motion Context. The 3D-MC is a motion oriented 3D 
version of shape context [16], [14]. It is based on a spherical 
histogram, which is centered in a reference point and divided 
linearly into S azimuthal (east-west) bins and T colatitudinal 
(north-south) bins, while the radial direction is divided into 
U bins (see Figure 1). The 3D-MC extends the regular shape 
context to represent the motion vector fields, by using both 


the location of motion, together with the amount of motion 
and its direction. For each bin of the spherical histogram the 
motion vector of each vertex falling within that particular bin, 
is accumulated into an embedded Histograms of 3D Optical 
Flow (3D-HOF). The 3D-HOF representation is divided into 
s azimuthal (east-west) orientation bins and t colatitudinal 
(north-south) bins, where each bin is weighted by the length 
of the velocity vectors falling within the bin. This results in a 
S x T xU x s xt dimensional feature vector for each frame. 
Partially invariance to the velocity of movements is imposed, 
like in the case where two individuals perform the same action 
at different speed, by thresholding and normalizing the feature 
vector. Hence, the descriptor gives greater emphasis to the 
location and orientation, while reducing the influence of large 
velocity values. 

Harmonic Motion Context. The 3D-MC descriptor is made 
view-invariant with respect to the vertical axis by decomposing 
the spherical representation / (#, </>) into a weighted sum of 
spherical harmonics: 


00 1 

/(w) = y y ATYnoA) (6) 
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where the term Af 1 is the weighing coefficient of degree m 
and order Z, while the complex functions Y™ (•) are the actual 
spherical harmonic functions of degree m and order l. 6 and 
(j) are the azimuthal and colatitudinal angle, respectively. In 
Figure 1 some examples of spherical harmonic basis functions 
are illustrated. The complex function (•) is given by 
Equation 7. 


y, m (0, (f>) = K™ 4 |m| (cos 9) e jm * (7) 

The term KJ 71 is a normalization constant, while the function 
pj 171 ^ (•) is the associated Legendre Polynomial. The key fea¬ 
ture to note from Equation 7 is the encoding of the azimuthal 
variable </>, which solely inflects the phase of the spherical 
harmonic function and has no effect on the magnitude. This 
effectively means that i.e. the norm of the decomposi¬ 

tion coefficients of Equation 6 is invariant to parameterization 
in the variable (j). 

The actual determination of the spherical harmonic co¬ 
efficients is based on an inverse summation as given by 
Equation 8, where N is the number of samples (S x T), and 





47r/N is the surface area of each sample on the unit sphere. 

a 2n 7T 

m fu = f EE *) Y i m (*« t) ( 8 ) 

0 = 0 6=0 

In a practical application it is not necessary (or possible, as 
there are infinitely many) to keep all coefficient A™ . Contrary, 
it is assumed the functions f u are band-limited, hence it is only 
necessary to keep coefficient up to some bandwidth l = B. 
Given the U different spherical shells, the dimensionality 
becomes D = U(B + 1)(B + 2)/2. However, since each 
bin of the spherical motion context representation consists 
of an embedded spherical function in form of a 3D-HOF 
representation, each of the inner 3D-HOF representations are 
first transformed up to some bandwidth B \, and thereafter the 
entire motion context is transformed up to some bandwidth 
B 2 . Hence, the resulting dimensionality D composed of each 
transformed 3D-HOF representation D\ and the transformed 
motion context D 2 becomes: 

D = D X D 2 = U{B 1 + l)(Bi + 2)(S a + 1)(S 2 + 2)/4 (9) 

Concretely, we set U = 4, B\ = 4 and B 2 = 5, resulting in 
4 x 315 coefficients. 

The spherical motion context histogram is centered in a 
reference point, which is estimated as the center of gravity of 
the human body, and the radial division into U bins is made 
in steps of 25 cm. Furthermore, weset5' = 12,T = 6,s = 8 
and t = 4, which has shown to produce good results in [27]. 

IV. Action Classification 

The classification of 3D human actions is carried out by 
matching the current descriptor with a known set of trained 
descriptors for each action class. First, the motion descriptors 
are accumulated over time (the video frames of the multi¬ 
view action sequences) to represent entire actions. However, 
since action sequences are of variable length, and actors have 
individual action performing speed variations, the accumulated 
representations have to be normalized. We normalize the accu¬ 
mulated descriptors implicitly in the classification by applying 
normalized correlation. 

The actual comparison of two descriptors (for both 3D- 
MC and HMC) is performed by computing the normalized 
correlation coefficient C, as given by Equation 10. To this 
end each descriptor is represented as a vector hi and h 2 of 
length n containing the value of the 3D-MC spherical bins 
(including the embedded orientation bins), and the (stacked) 
spherical harmonic coefficients for the HMC descriptor: 

C'(h 1 ,h 2 )= (10) 

_ n E h ih 2 -E h iE h 2 _ 

fEEAE 2 - Q>i) 2 ] [nE(h 2 ) 2 - (EM 2 ] 

We make the 3D-MC descriptor view-independent by ver¬ 
tical rotation of the representation, then we compute a set 
of normalized correlation coefficients for a discrete number 



Fig. 4. Image and 3D mesh model examples for the 10 actions from the 
i3DPost Multi-View Human Action Dataset. 

of angular rotations, and select the highest matching score. 
The system is trained by generating a representative set of 
descriptors for each action class. A reference descriptor is then 
estimated as the average of all these descriptors for each class. 

V. Experimental Results 

To test our proposed approach we conduct a number of 
experiments: (1) action recognition using different action sub¬ 
sets, (2) an comparison of the 3D-MC and HMC descriptors, 
(3) evaluation of the motion detection, and (4) performance 
evaluation with variable number of camera views used for 
training and testing of the system. 

The i3DPost Multi-View Human Action Dataset. We eval¬ 
uate our approach using the publicly available i3DPost Multi- 
View Human Action Dataset [5]. The dataset consist of 8 
actors performing 10 different actions, where 6 are single 
actions: walk, run, jump, bend, hand-wave and jump-in-place, 
and 4 are combined actions: sit-stand-up, run-fall, walk-sit and 
run-jump-walk. Additionally, the dataset also contains 2 inter¬ 
actions: handshake amd pull, and 6 basic facial expressions, 
which will not be considered in our evaluation. The subjects 
have different body sizes, clothing and are of different sex and 
nationalities. The multi-view videos have been recorded by a 
8 calibrated and synchronized camera setup in high definition 
resolution (1920 x 1080), resulting in a total of 640 videos 
(excluding videos of interactions and facial expressions). For 
each video frame a 3D mesh model of relatively high detail 
level (20,000-40,000 vertices and 40,000-80,000 triangles) 
of the actor and the associated camera calibration parameters 
are available. The mesh models were reconstructed using a 
global optimization method proposed by Starck and Hilton [7]. 
Figure 4 shows multi-view actor/action and 3D mesh model 
examples from the i3DPost dataset. 

3D Human Action Recognition. For the first test we use the 

data available for all 8 camera views. We perform leave-one- 
out cross validation, hence, we use one actor for testing, while 
the system is trained using the rest of the dataset. Table I 
presents the results of our approach using the 3D-MC and 
HMC descriptors in comparison to Gkalelis et al. [8]. The 
results show comparable performance for the 3D-MC and 
HMC descriptors, but with a slightly better overall perfor¬ 
mance using 3D-MC. For the full action set of 10 actions, 
the accuracy for 3D-MC and HMC are 80.00% and 76.25%, 
respectively. The confusion matrices for this test are shown in 







TABLE I 

Recognition results for different sets of actions using the 3D-MC and HMC descriptors compared to Gkalelis et al. [8]. 


Method (%) 

10 actions 

6 single actions 

4 combined actions 

9 actions 

5 single actions 

4 single actions 

3D-MC 

80.00 

89.58 

84.38 

84.72 

97.50 

100.00 

3D-MC-mean 

77.50 

87.50 

81.25 

83.33 

95.00 

100.00 

HMC 

76.25 

85.42 

87.50 

81.94 

95.00 

100.00 

HMC-mean 

68.75 

79.17 

84.38 

73.61 

90.00 

93.75 

nkalelk TR1 

- 

- 

- 

- 

90.00 
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Fig. 5. Confusion matrices for all 10 actions using (a) 3D-MC and (b) HMC descriptors. 


Figure 5. As can be seen, the main errors for both descriptors 
occur due to confusion between single actions (walk and run) 
and combined actions, which consist of the same single actions 
(walk-sit and run-jump-walk). Additionally, for HMC there 
is some confusion between bend and sit-stand-up , which are 
very similar actions. Furthermore, there is confusion between 
the two single actions: walk and run, and the two combined 
actions: walk-sit and run-jump-walk , respectively. These errors 
possibly result from a combination of descriptor normalization 
and a relatively coarse division of the descriptors. While the 
normalization incorporates robustness to performing speed 
variations of different actors, it reduces the discriminative 
power to distinguish between similar movements, which are 
characterized by the velocity, like walk and run. Combined 
with a coarse division of the descriptors, the representations 
might not be descriptive enough to capture the difference 
of these actions. If we exclude the run action we obtain 
approximately a 5% increase in the recognition rates. 

Separating Single and Combined Actions. We now divide 
the dataset into 6 single and 4 combined actions and recognize 
each action set, separately. For the single action set the accu¬ 
racies of 3D-MC and HMC are 89.58% and 85.42%, and for 
the combined actions 84.38% and 87.50%, respectively. The 
errors are similar to the confusions reported above, where the 
single actions: walk and run , and the combined actions: walk- 
sit and run-jump-walk are confused, respectively. It should be 
noted that the combined actions are more challenging than the 


single actions. 

To compare our results to Gkalelis et al. [8], who report an 
accuracy of 90.00% for 5 of the single actions, we exclude 
one single action (run) and recognize 97.50% and 95.00% 
of the actions correctly. By excluding two single actions 
we achieve a 100.00% accuracy for both descriptors. These 
results are consistent with our expectations and the comparison 
of the shape histogram (3D shape context) and the spherical 
harmonic representation (harmonic shape context) reported by 
Huang et al. [18], where the shape histogram also performs 
slightly better than spherical harmonics. The results for 3D- 
MC are in general slightly better (^4%) than HMC, since 
3D-MC is made view-independent by vertical rotation, and 
the best match is chosen. In contrast, HMC is a view-invariant 
representation, implicitly accounting for changing view-points. 
Furthermore, it is an approximation of the 3D-MC descriptor 
by decomposing the representation into spherical harmonic 
basis functions within a certain bandwidth. Hence, the clas¬ 
sification of HMC is not only less computational expensive, 
but the dimensionality of the descriptor can also be controlled 
and reduced by the chosen bandwidth. 

Evaluation of 3D Motion Detection. We evaluate the quality 
of the estimated motion vector fields by comparing our method 
to fuse 3D optical flow from multiple views and the “mean 
3D optical flow” determined by the average 3D flow for each 
view (see Figure 2). For this purpose we conduct a test using 
all 8 camera views and a variable number of actions, and 





















0.95 


Accuracy 


1 

0.95 

0.9 

0.85 

0.8 

0.75 

0.7 

0.65 

1 



Accuracy 


No. views 



0.65 

8 


No. train views 1 1 


No. test views 



Accuracy 
0.8 
0.7 
0.6 
0.5 


No. train views 1 


No. test views 


(a) 3D-MC 


(b) HMC 


(c) 3D-MC 


(d) HMC 


Fig. 7. Plots of the recognition accuracy as a function of the number of applied camera views, (a) and (b) present results for variable number of views and 
actions, (c) and (d) show results using a variable number of views for training and testing of the system, separately. 



Fig. 6. Plots of the recognition accuracy as a function of the number of 
classified actions. 

compare the recognition accuracy for the two descriptors using 
our method (3D-MC and HMC) and the “mean 3D optical 
flow” (3D-MC-mean and HMC-mean). The results are shown 
in Table I and Figure 6. An overall increase in the performance 
can be observed (up to 8.3%) using our method, which 
validates the robustness of our approach to estimate motion 
vector fields for rich 3D motion description. It should be noted 
that the descriptors incorporate robustness to erroneous motion 
vectors implicitly. 

Variable Number of Camera Views. The main objective of 
this evaluation is to test the influence of the number of views 
(1-8) used in the multi-view camera system, and how it affects 
the action recognition accuracy. First, we test the number of 
applied views versus the number of actions to be recognized. 
Figure 7.a and 7.b present plots of the results using the 3D- 
MC and HMC descriptors. Most important to notice is the 
significant performance increase (up to 13.9%), which occurs 
when going from one single view to combining two views. 
The influence is especially noticeable, when discriminating 
between a larger number of actions, which evidently relies 
on the quality of the extracted motion used for description. 
When introducing more views the performance improves more 
moderately, and at 3-4 views it stabilizes. Note that, by using 4 
views 3D-HC recognizes 5 single actions perfectly (100.00% 
accuracy). Additionally, HMC seems to be more sensitive to 
the number of applied views than 3D-MC. 

Next, we perform action recognition using all 10 actions 
but with a variable number of views to train and test the 
system, separately. The results are shown in Figure 7.c and 7.d. 
Here, the performance boost (16.3%), when fusing two views, 


is even more noticeable than in the first test case. Similar 
behavior is taking effect when applying more than two views. 
However, the 3D-MC descriptor already stabilizes at 2 testing 
views, while the training phase first stabilized at 4 views. In 
contrast, the HMC descriptor stabilizes more slowly at a higher 
number of views (4-6 views). Notice how 3D-MC gives a 
higher accuracy (82.50%), using 5-6 training and 3-4 testing 
views, than for all 8 views. 

VI. Conclusion 

In this paper we have presented an approach for human 
action recognition in 3D for multi-view camera systems. 
One of the main concepts of our approach is the proposed 
estimation of 3D optical flow, and how it is combined into 
motion vector fields by considering the significance of local 
motion and its reliability. This novel technique to derive 3D 
motion information has shown to be robust and produces 
consistent and homogeneous vector fields with few outliers 
and erroneous motion vectors. We have applied and compared 
two 3D motion descriptors (3D-MC and HMC) and shown 
promising results for the i3DPost Multi-View Human Action 
Dataset, within an accuracy range of 76-100%, using all 
10 actions and by separating the action datasets into single 
and combined action sets. Furthermore, we have evaluated 
the performance of the 3D-MC and HMC descriptors for 
a variable number of actions and camera views used for 
training and testing of the system. 
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